## 第一次作业 Hollow Man

本次作业根据: https://blog.csdn.net/weixin\_35433448/article/details/112102416 所述方法 进行 FPGA 并行单精度浮点数运算能力的计算,结合 DSP/LUT/DFF 资源综合计算,其计算 公式为:

FPGA FLOPS = DSP 个数 x DSP 频率 + 逻辑单元个数 x 逻辑单元频率

参考 Xilinx Spartan-6 系列 FPGA 官方介绍文档:

https://china.xilinx.com/support/documentation/data\_sheets/ds160.pdf ,以包含 DSP 数目最多的 XC6SLX150T 为例,其包含 DSP48A1 180 个,Logic Cells 147443 个,抛去估算的用于 I/O 设备的 14000 个,并且 1 个基于 DSP48 的加法器需要 2 个 DSP slices 和 289LUT-FF pairs 组成,1 个基于 Logic cell 的加法器需要 517 Logic Cells 组成。(具体见问题(2)中所示文档),FPGA 频率最大为 390MHz,且可以在每个时钟周期可以做 2 个单精度浮点计算(乘和加)。则算式为:

[180/2 + (147443-14000-90\*289)/517] \* 390 MHz = 116.142 GFLOPs, 远低于 Intel I7 6900K 和 6700K, 因而 I7 的浮点数处理能力更强。

Table 1: Spartan-6 FPGA Feature Summary by Device

| Device     |                      | Configurable Logic Blocks (CLBs) |            |                                | Block RA                         | M Blocks             |          | Memory              | Fadasiat                                     | Maximum                               | Total               | Max          |             |
|------------|----------------------|----------------------------------|------------|--------------------------------|----------------------------------|----------------------|----------|---------------------|----------------------------------------------|---------------------------------------|---------------------|--------------|-------------|
|            | Cells <sup>(1)</sup> | Slices <sup>(2)</sup>            | Flip-Flops | Max<br>Distributed<br>RAM (Kb) | DSP48A1<br>Slices <sup>(3)</sup> | 18 Kb <sup>(4)</sup> | Max (Kb) | CMTs <sup>(5)</sup> | Controller<br>Blocks<br>(Max) <sup>(6)</sup> | Endpoint<br>Blocks for<br>PCI Express | GTP<br>Transceivers | I/O<br>Banks | User<br>I/O |
| XC6SLX4    | 3,840                | 600                              | 4,800      | 75                             | 8                                | 12                   | 216      | 2                   | 0                                            | 0                                     | 0                   | 4            | 132         |
| XC6SLX9    | 9,152                | 1,430                            | 11,440     | 90                             | 16                               | 32                   | 576      | 2                   | 2                                            | 0                                     | 0                   | 4            | 200         |
| XC6SLX16   | 14,579               | 2,278                            | 18,224     | 136                            | 32                               | 32                   | 576      | 2                   | 2                                            | 0                                     | 0                   | 4            | 232         |
| XC6SLX25   | 24,051               | 3,758                            | 30,064     | 229                            | 38                               | 52                   | 936      | 2                   | 2                                            | 0                                     | 0                   | 4            | 266         |
| XC6SLX45   | 43,661               | 6,822                            | 54,576     | 401                            | 58                               | 116                  | 2,088    | 4                   | 2                                            | 0                                     | 0                   | 4            | 358         |
| XC6SLX75   | 74,637               | 11,662                           | 93,296     | 692                            | 132                              | 172                  | 3,096    | 6                   | 4                                            | 0                                     | 0                   | 6            | 408         |
| XC6SLX100  | 101,261              | 15,822                           | 126,576    | 976                            | 180                              | 268                  | 4,824    | 6                   | 4                                            | 0                                     | 0                   | 6            | 480         |
| XC6SLX150  | 147,443              | 23,038                           | 184,304    | 1,355                          | 180                              | 268                  | 4,824    | 6                   | 4                                            | 0                                     | 0                   | 6            | 576         |
| XC6SLX25T  | 24,051               | 3,758                            | 30,064     | 229                            | 38                               | 52                   | 936      | 2                   | 2                                            | 1                                     | 2                   | 4            | 250         |
| XC6SLX45T  | 43,661               | 6,822                            | 54,576     | 401                            | 58                               | 116                  | 2,088    | 4                   | 2                                            | 1                                     | 4                   | 4            | 296         |
| XC6SLX75T  | 74,637               | 11,662                           | 93,296     | 692                            | 132                              | 172                  | 3,096    | 6                   | 4                                            | 1                                     | 8                   | 6            | 348         |
| XC6SLX100T | 101,261              | 15,822                           | 126,576    | 976                            | 180                              | 268                  | 4,824    | 6                   | 4                                            | 1                                     | 8                   | 6            | 498         |
| XC6SLX150T | 147,443              | 23,038                           | 184,304    | 1,355                          | 180                              | 268                  | 4,824    | 6                   | 4                                            | 1                                     | 8                   | 6            | 540         |

#### Digital Signal Processing—DSP48A1 Slice

DSP applications use many binary multipliers and accumulators, best implemented in dedicated DSP slices. All Spartan-6 FPGAs have many dedicated, full-custom, low-power DSP slices, combining high speed with small size, while retaining system design flexibility.

Each DSP48A1 slice consists of a dedicated  $18 \times 18$  bit two's complement multiplier and a 48-bit accumulator, both capable of operating at up to 390 MHz. The DSP48A1 slice provides extensive pipelining and extension capabilities that enhance speed and efficiency of many applications, even beyond digital signal processing, such as wide dynamic bus shifters, memory address generators, wide bus multiplexers, and memory-mapped I/O register files. The accumulator can also be used as a synchronous up/down counter. The multiplier can perform barrel shifting.

(1) 打开 Xilinx 官网,可看到目前 Xlinx 最高端的 FPGA 是 Zynq UltraScale+ RFSoC ZU49DR,参考其官方介绍文档:

https://china.xilinx.com/support/documentation/data\_sheets/ds890-ultrascale-overview.pdf , Zynq UltraScale+ ZU49DR 拥有的资源列表,我们可以看到它有 930300 个 Logic cell, 4272 个 DSP slices。

#### Zyng UltraScale+ RFSoC: Device Feature Summary

Table 19: Zynq UltraScale+ RFSoC Feature Summary

|                                        |                    | ZU21DR                                                                                                      | ZU25DR                                                                                                                             | ZU27DR        | ZU28DR       | ZU29DR         | ZU39DR        | ZU4      | 2DR       | ZU43DR          | ZU46DR      | ZU47DR         | ZU48DR  | ZU49DR  |  |
|----------------------------------------|--------------------|-------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------|---------------|--------------|----------------|---------------|----------|-----------|-----------------|-------------|----------------|---------|---------|--|
| 12-bit                                 | # of ADCs          | 0                                                                                                           | 8                                                                                                                                  | 8             | 8            | 16             | 16            |          | -         | -               | -           | -              | -       | -       |  |
| RF-ADC w/<br>DDC                       | Max Rate (GSPS)    | -                                                                                                           | 4.096                                                                                                                              | 4.096         | 4.096        | 2.058          | 2.220         | -        | -         | -               | -           | -              | -       | -       |  |
| 14-bit<br>RF-ADC w/                    | # of ADCs          | -                                                                                                           | -                                                                                                                                  | -             | -            | -              | -             | 8        | 2         | 4               | 8 4         | 8              | 8       | 16      |  |
| DDC W/                                 | Max Rate (GSPS)    | -                                                                                                           | -                                                                                                                                  | -             | -            | -              | -             | 2.5      | 5.0       | 5.0             | 2.5 5.0     | 5.0            | 5.0     | 2.5     |  |
| 14-bit<br>RF-DAC w/                    | # of DACs          | 0                                                                                                           | 8                                                                                                                                  | 8             | 8            | 16             | 16            | 8        | 8         | 4               | 12          | 8              | 8       | 16      |  |
| DUC W/                                 | Max Rate (GSPS)    | -                                                                                                           | 6.554                                                                                                                              | 6.554         | 6.554        | 6.554          | 6.554         | 10       | 0.0       | 10.0            | 10.0        | 10.0           | 10.0    | 10.0    |  |
| SD-FEC                                 |                    | 8                                                                                                           | 0                                                                                                                                  | 0             | 8            | 0              | 0             | (        | 0         | 0               | 8           | 0              | 8       | 0       |  |
| Application Pro                        | ocessing Unit      |                                                                                                             | Quad-core Arm Cortex-A53 MPCore with CoreSight™; NEON and Single/Double Precision Floating Point; 32KB/32KB L1 Cache, 1MB L2 Cache |               |              |                |               |          |           |                 |             |                |         |         |  |
| Real-Time Pro                          | cessing Unit       |                                                                                                             |                                                                                                                                    | Dual-core Arm | Cortex-R5F w | ith CoreSight  | ; Single/Doub | e Precis | sion Floa | ating Point; 32 | KB/32KB L1  | Cache, and TCI | 4       |         |  |
| Embedded and External Memory           |                    |                                                                                                             | 25                                                                                                                                 | 6KB On-Chip   | Memory w/EC  | C; External DI | DR4; DDR3; D  | DR3L; L  | LPDDR4    | ; LPDDR3; Ex    | ternal Quad | SPI; NAND; eM  | MC      |         |  |
| General Connectivity                   |                    | 214 PS I/O; UART; CAN; USB 2.0; 12C; SPI; 32b GPIO; Real Time Clock; Watchdog Timers; Triple Timer Counters |                                                                                                                                    |               |              |                |               |          |           |                 |             |                |         |         |  |
| High-Speed C                           | onnectivity        | 4 PS-GTR; PCIe⊛ Gen1/2; Serial ATA 3.1; DisplayPort 1.2a; USB 3.0; SGMII                                    |                                                                                                                                    |               |              |                |               |          |           |                 |             |                |         |         |  |
| System Logic Cells                     |                    | 930,300                                                                                                     | 678,318                                                                                                                            | 930,300       | 930,300      | 930,300        | 930,300       | 489,     | ,300      | 930,300         | 930,300     | 930,300        | 930,300 | 930,300 |  |
| CLB Flip-Flops                         |                    | 850,560                                                                                                     | 620,176                                                                                                                            | 850,560       | 850,560      | 850,560        | 850,560       | 447,     | ,360      | 850,560         | 850,560     | 850,560        | 850,560 | 850,560 |  |
| CLB LUTs                               |                    | 425,280                                                                                                     | 310,088                                                                                                                            | 425,280       | 425,280      | 425,280        | 425,280       | 223,     | ,680      | 425,280         | 425,280     | 425,280        | 425,280 | 425,280 |  |
| Distributed RAM (Mb)                   |                    | 13.0                                                                                                        | 9.6                                                                                                                                | 13.0          | 13.0         | 13.0           | 13.0          | 6.       | .8        | 13.0            | 13.0        | 13.0           | 13.0    | 13.0    |  |
| Block RAM Blocks                       |                    | 1,080                                                                                                       | 792                                                                                                                                | 1,080         | 1,080        | 1,080          | 1,080         | 64       | 48        | 1,080           | 1,080       | 1,080          | 1,080   | 1,080   |  |
| Block RAM (M                           | b)                 | 38.0                                                                                                        | 27.8                                                                                                                               | 38.0          | 38.0         | 38.0           | 38.0          | 22       | 2.8       | 38.0            | 38.0        | 38.0           | 38.0    | 38.0    |  |
| UltraRAM Bloc                          | ks                 | 80                                                                                                          | 48                                                                                                                                 | 80            | 80           | 80             | 80            | 16       | 50        | 80              | 80          | 80             | 80      | 80      |  |
| UltraRAM (Mb                           | )                  | 22.5                                                                                                        | 13.5                                                                                                                               | 22.5          | 22.5         | 22.5           | 22.5          | 45       | 5.0       | 22.5            | 22.5        | 22.5           | 22.5    | 22.5    |  |
| DSP Slices                             |                    | 4,272                                                                                                       | 3,145                                                                                                                              | 4,272         | 4,272        | 4,272          | 4,272         | 1,8      | 372       | 4,272           | 4,272       | 4,272          | 4,272   | 4,272   |  |
| CMTs                                   |                    | 8                                                                                                           | 6                                                                                                                                  | 8             | 8            | 8              | 8             | 5        | 5         | 8               | 8           | 8              | 8       | 8       |  |
| Maximum HP                             | I/O                | 208                                                                                                         | 299                                                                                                                                | 299           | 299          | 312            | 312           | 12       | 28        | 299             | 312         | 299            | 299     | 312     |  |
| Maximum HD                             | I/O                | 72                                                                                                          | 48                                                                                                                                 | 48            | 48           | 96             | 96            | 2        | 4         | 48              | 48          | 48             | 48      | 96      |  |
| System Monito                          | or                 | 1                                                                                                           | 1                                                                                                                                  | 1             | 1            | 1              | 1             | 1        | 1         | 1               | 1           | 1              | 1       | 1       |  |
| GTY Transceiv                          | ers                | 16                                                                                                          | 8                                                                                                                                  | 16            | 16           | 16             | 16            | 8        | В         | 16              | 16          | 16             | 16      | 16      |  |
| Transceivers F                         | ractional PLLs     | 8                                                                                                           | 4                                                                                                                                  | 8             | 8            | 8              | 8             | 4        | 4         | 8               | 8           | 8              | 8       | 8       |  |
| PCIE4 (PCIe G                          | Gen3 x16)          | 2                                                                                                           | 1                                                                                                                                  | 2             | 2            | 2              | 2             | -        | -         | -               | -           | -              | -       | -       |  |
| PCIE4C (PCIe<br>/ CCIX) <sup>(1)</sup> | Gen3 x16 / Gen4 x8 | -                                                                                                           | -                                                                                                                                  | -             | -            | -              | -             | (        | 0         | 2               | 2           | 2              | 2       | 2       |  |
| 150G Interlak                          | en                 | 1                                                                                                           | 1                                                                                                                                  | 1             | 1            | 1              | 1             | (        | 0         | 1               | 1           | 1              | 1       | 1       |  |
| 100G Etherne                           | t w/ RS-FEC        | 2                                                                                                           | 1                                                                                                                                  | 2             | 2            | 2              | 2             | (        | 0         | 2               | 2           | 2              | 2       | 2       |  |

- 1 个基于 DSP48 的加法器需要 2 个 DSP slices 和 289LUT-FF pairs 组成。
- 1 个基于 Logic cell 的加法器需要 517 Logic Cells 组成。

Table 32: Characterization of Single-Precision Format on Kintex-7 FPGAs (Part = XC7K70T-1)

|              | Resources                             |        |                 |      |     |                   |  |  |  |
|--------------|---------------------------------------|--------|-----------------|------|-----|-------------------|--|--|--|
| Operation    | Embedded                              | F      | Kintex-7        |      |     |                   |  |  |  |
|              | Туре                                  | Number | LUT-FF<br>Pairs | LUTs | FFs | -1 Speed<br>Grade |  |  |  |
| Multiply     | DSP48E1 (max usage)                   | 3      | 125             | 99   | 105 | 463               |  |  |  |
|              | DSP48E1 (full usage)                  | 2      | 170             | 117  | 160 | 463               |  |  |  |
|              | DSP48E1 (medium usage)                | 1      | 334             | 285  | 331 | 462               |  |  |  |
|              | Logic                                 | 0      | 666             | 629  | 669 | 449               |  |  |  |
| Add/Subtract | DSP48E1 (speed optimized, full usage) | 2      | 289             | 230  | 327 | 419               |  |  |  |
|              | Logic (speed optimized, no usage)     | 0      | 517             | 403  | 541 | 438               |  |  |  |
|              | Logic (low latency)                   | 0      | 587             | 482  | 610 | 7.2 7.464         |  |  |  |

由于实现相关的 I/O 设备,必须占用掉一定数量的 Logic Cell,这里我们假设用掉 14000 个 Logic Cell. 也即: Logic Cell 剩下总数 = 930300 - 14000 = 916300.

由于要计算出最大值, 我们需要假设尽可能多的使用所有资源, 这样可以得出:

DSP48 based adder amount =  $4272 / 2 = 2136 (^{\circ})$ 

LC based Adder amount =  $(916300 - 2136*289) / 517 = 578(^)$ 

基于 DSP48 的加法器的 clock 范围在: 600 Mhz(slow) - 891Mhz (fastest)

Notes:

1. This block operates in compatibility mode for 16.0GT/s (Gen4) operation. Go to PG213, UltraScale+ Devices Integrated Block for PCI Express Product Guide, for details on compatibility mode.

## **DSP48 Slice Switching Characteristics**

Table 83: DSP48 Slice Switching Characteristics

|                                                                                       |                                                        | Speed Grade and V <sub>CCINT</sub> Operating Voltages |     |     |     |       |                                         |  |  |
|---------------------------------------------------------------------------------------|--------------------------------------------------------|-------------------------------------------------------|-----|-----|-----|-------|-----------------------------------------|--|--|
| Symbol                                                                                | Description                                            | 0.90V                                                 | 0.8 | 35V | 0.7 | Units |                                         |  |  |
|                                                                                       |                                                        | -3                                                    | -2  | -1  | -2  | -1    |                                         |  |  |
| Maximum Frequency                                                                     |                                                        |                                                       |     |     |     |       | *************************************** |  |  |
| F <sub>MAX</sub>                                                                      | With all registers used                                | 891                                                   | 775 | 645 | 644 | 600   | MHz                                     |  |  |
| F <sub>MAX_PATDET</sub>                                                               | With pattern detector                                  | 794                                                   | 687 | 571 | 562 | 524   | MHz                                     |  |  |
| F <sub>MAX_MULT_NOMREG</sub>                                                          | Two register multiply without MREG                     | 635                                                   | 544 | 456 | 440 | 413   | MHz                                     |  |  |
| F <sub>MAX_MULT_NOMREG_PATDET</sub>                                                   | Two register multiply without MREG with pattern detect | 577                                                   | 492 | 410 | 395 | 371   | MHz                                     |  |  |
| F <sub>MAX_PREADD_NOADREG</sub>                                                       | Without ADREG                                          | 655                                                   | 565 | 468 | 453 | 423   | MHz                                     |  |  |
| F <sub>MAX</sub> , NOPIPELINEREG                                                      | Without pipeline registers (MREG, ADREG)               |                                                       | 410 | 338 | 323 | 304   | MHz                                     |  |  |
| MAX_NOPIPELINEREG_PATDET Without pipeline registers (MREG, ADREG) with pattern detect |                                                        | 448                                                   | 379 | 314 | 299 | 280   | MHz                                     |  |  |

#### Notes:

基于 Logic cell 的加法器的 clock 范围在: 667 Mhz(slow) - 891Mhz (fastest)

### **Clock Buffers and Networks**

Table 84: Clock Buffers Switching Characteristics

| Symbol           |                                                                                                               | Speed Grade and V <sub>CCINT</sub> Operating Voltages |         |     |     |            |        |  |  |
|------------------|---------------------------------------------------------------------------------------------------------------|-------------------------------------------------------|---------|-----|-----|------------|--------|--|--|
|                  | Description                                                                                                   | 0.90V                                                 | 0.8     | 35V | 0.  | 72V        | Units  |  |  |
|                  | (6)                                                                                                           | -3                                                    | -2      | -1  | -2  | -1         | 1      |  |  |
| Global Clo       | ck Switching Characteristics (Including BUFGCTRL)                                                             |                                                       |         |     |     |            | -      |  |  |
| F <sub>MAX</sub> | Maximum frequency of a global clock tree (BUFG)                                                               | 891                                                   | 775     | 667 | 725 | 667        | MHz    |  |  |
| Global Cloc      | ck Buffer with Input Divide Capability (BUFGCE_DIV)                                                           |                                                       |         |     |     |            |        |  |  |
| FMAX             | Maximum frequency of a global clock buffer with input divide capability (BUFGCE_DIV)                          | 891                                                   | 775     | 667 | 725 | 667        | MHz    |  |  |
| Global Clo       | ck Buffer with Clock Enable (BUFGCE)                                                                          |                                                       | -       |     |     | 1          |        |  |  |
| FMAX             | Maximum frequency of a global clock buffer with clock<br>enable (BUFGCE)                                      | 891                                                   | 775     | 667 | 725 | 667        | MHz    |  |  |
| Leaf Clock       | Buffer with Clock Enable (BUFCE_LEAF)                                                                         |                                                       |         |     |     |            |        |  |  |
| F <sub>MAX</sub> | Maximum frequency of a leaf clock buffer with clock<br>enable (BUFCE_LEAF)                                    | 891                                                   | 775     | 667 | 725 | 667        | MHz    |  |  |
| GTH or GTY       | Clock Buffer with Clock Enable and Clock Input Divide (                                                       | apability (B                                          | UFG_GT) |     |     |            |        |  |  |
| F <sub>MAX</sub> | Maximum frequency of a serial transceiver clock buffer<br>with clock enable and clock input divide capability | 512                                                   | 512     | 512 | 知时  | (W) 7547FF | TARIS! |  |  |

所以(2136+578)\*891MHz = 2418.174 GFLOPs

<sup>1.</sup> For devices operating at the lower power V<sub>CCINT</sub> = 0.72V voltages, DSP cascades that cross the clock region of the product of the specified F<sub>MAX</sub>.





# 第一次作业



## 第一次作业



Intel I7 6900K 的单精度浮点数运算能力为 1024GFlops, 6700K 为 511.3GFlops。

- (1)请估算,如果用低端的 Xlinx Spartan-6 系列 FPGA 来实现并行的单精度浮点数,可以达到多少 GFlops?与 I7 的浮点 数处理能力比起来那种强?
- (2)当前 Xlinx 最高端的 FPGA 是那种?估算这种 FPGA 芯片的浮点数性能。

